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Abstract 

In this paper, we propose a weight-based feature extrac- 
linn approai li to reduce the number of features for text clas- 
sification. The number of extracted features is equal to the 
number of document classes and the feature values are ob- 
tained according to the distributions of words over class 
partitions. Each word of the original word set contributes a 
weight to each extracted feature and a transformation ma- 
trix is formed. By using the transformation matrix, the orig- 
inal document set is converted to a new set with a smaller 
number of features. The proposed approach has two ad- 
vantages. Trial-and-error for determining the appropriate 
number of extracted features can be avoided. < 'amputation 
demand is small and the method runs fast. Experimental 
results obtained from real-world data sets have shown that 
our method can perform better than other methods. 



1. Introduction 

In this paper, a new feature reduction approach for doc- 
ument data is proposed. Recently, text data processing ap- 
proaches have attracted more and more attention. These ap- 
proaches have to deal with an important problem of a large 
number of features. For example, two real-world data sets, 
20 Newsgroups and Reuters21578 top-10, both have more 
than 15,000 features. Such high dimensionality is a severe 
obstacle for classification algorithms [1]. To alleviate this 
difficulty, feature reduction approaches are applied before 
document classification tasks arc performed. 

Two major approaches, feature selection and feature ex- 
traction, have been proposed for feature reduction. The fea- 
ture selection methods select a subset of the original fea- 
tures and the classifier only uses the subset instead of all 
the original features to perform the text classification task. 
A well-known feature selection approach is based on Infor- 
mation Gain [3], which is an information-theoretic measure 
defined by the amount of reduced uncertainty given a piece 
of information. The feature extraction methods convert the 
representation of the original documents to a new represen- 



tation based on a smaller set of synthesized features. Word 
clustering |4|-|8| is one of effective techniques for feature 
extraction. The idea of word clustering is to group words 
with a high degree of pairwise semantic relatedness into 
clusters and each word cluster is then treated as a single 
feature and thus feature dimensionality can be drastically 
reduced. 

The first feature extraction method based on word clus- 
tering was suggested by Baker and McCallum[4] derived 
from the 'distributional clustering' idea of Pereira et al. 
[7]. An Information Bottleneck approach was proposed 
by Tishby et al. [5][6] and showed that word clustering 
approaches are more effective than feature selection ones. 
A Divisive Information-Theoretic method was proposed by 
Dhillon et al. [8], which is more effective than other word 
clustering methods. However, both information gain and 
clustering word based methods only use a part of the orig- 
inal words to generate new features. For information gain 
based methods, only a subset of the original words is used, 
for word clustering based method, each new feature is gen- 
erated by combining a subset of the original words. Such 
methods ignore useful information that may be provided by 
the unused words. 

In this paper, we propose a weight-based feature extrac- 
tion approach to reduce the number of features for text clas- 
sification. The number of extracted features is equal to the 
number of document classes and the feature values are ob- 
tained according to the distributions of words over class 
partitions. Each word of the original word set contributes 
a weight to each extracted feature and a transformation ma- 
trix is formed. By using the transformation matrix, the orig- 
inal document set is converted to a new set with a smaller 
number of features. The proposed approach has two ad- 
vantages. Trial-and-error for determining the appropriate 
number of extracted features can be avoided. Computation 
demand is small and the method runs fast. Experimental re- 
sults obtained from two real-world data sets, 20 Newsgroup 
and Reuter-21578 Top 10, have shown that our method can 
perform better than information gain and word clustering 
based methods. 
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2. Background and related work 

To process documents, the bag-of-words model[2] is 
usually used. Let di be a document and the set D = 

{<:/[, </■_> d „} represent n documents. Let the word set 

W = {wi, 1V2,..., w m } be the feature set of the docu- 
ments. Each document d s -, 1 < i < n, can be repre- 
sented as di =< Wii,w i2 , . . . ,w im >, where each w io - 
denotes the number of occurrence of Wj in document rf,. 
The feature reduction task is to find a new word set W = 
{w'ijiv'z, . . . ,w' k }, k < m, such that W and W work 
equally well for all the desired properties with D. After 
feature reduction, each document d t is converted to a new 
representation d' t =< w' n , w' i2 , . . . , w' ik > and the con- 
verted document set is D' = {d^d^, . . . , d' n }. If k is very 
much smaller than m, computation cost can be drastically 
reduced. 

2.1. Feature selection 



1 < g, j < A; and j ^ q. Note that a cluster is equivalent to 
an element in the partition. 

The distributional word clustering method calculates the 
distributions of words over classes, Pr{C\u-,), I < i < m, 

where C = {ci, c 2 , , c p } and p is the number of class 

labels, and uses Kullbin -k-l.ciblcr divergence to measure the 
dissimilarity between two distributions. The distribution of 
a cluster 1 1 } is calculated as follows: 

P{c\w j )= v P(Wt) p( w ) P(clwt) - (2) 

The goal of distributional word clustering is to minimize the 
following objective function: 

£ Yl P(wt)KL(P(C\w t ),P(C\Wj)). (3) 

j=l w t €W d 

Which takes the sum over all the clusters. 



In feature selection approaches|3 ], the new feature set 
W = {w[, w' 2 , . . . , w' k } is a subset of the original features. 
This approach only uses the selected features as inputs for 
classification tasks. 

Information Gain is frequently employed in the feature 
selection approach. It measures the reduced uncertainty 
by an information-theoretic measure and gives each word 
a weight. The bigger the weight of a word is, the larger the 

reduced uncertainty by the word is. Let {ci, c 2 , , c p } 

denote the set of classes. The weight of a word w t is calcu- 
lated as follows: 

G{wi)= - YTi=i Pr{ci)logPr{ci) 

+Pr{w i )Y? l , TMc^logPr^Wi) 
+ Pr(w i )Y? l=l Pr(c i \w i )logPr(c,\w i ). 

(1) 

The words of top k weights in W arc selected as the features 



2.2. Feature extraction 

Unlike feature selection, feature extraction combines the 
original features to generate new features. For example, the 
word clustering based feature extraction methods combine 
the words of a subset of the original features into a new 
feature. 

The word clustering methods proposed in [4]-[8] are 
"hard" clustering methods where each word of the orig- 
inal features belongs to only one word cluster. There- 
fore each word contributes to the synthesis of only one 
new feature. Each new feature is obtained by summing 
up the words belonging to one cluster. The new fea- 
ture set W = {w' 1 ,w' 2 , ... ,w' k } corresponds to a parti- 
tion { W x , W 2 , . . . , W k } of W, i.e., Wj f\W q = 0, where 



3. Proposed method 

Let D be the matrix consisting of all the original doc- 
ument with ;/) features and I >' be the matrix consisting of 
the converted document with new k features. The feature 
reduction task can be written in the following form: 



D' = DT. 



(4) 





di 




d'i 






d 2 








D = 




,D' = 




,T = 








d'n 





Our goal is to find a transformation matrix T to convert D 
to D' in a desirable way. 

The document classification task is a supervised work 
where each document of the training data set has a given 
class label. Since we have the information about class la- 
bels, intuitively we can synthesize a new feature to distin- 
guish the documents of one class from the documents of 
the other classes. Also, it seems reasonable that a word is 
well related to a class if the word occurs more frequently in 
the documents of the class. By this motivation, we propose 
a new approach to feature extraction for text classification. 
Firstly, we assume that we can synthesize new features from 
the original features to distinguish one class from another, 
and that the number of new features is equal to the num- 
ber of classes. Secondly, we assume that new features can 
be obtained by considering the degrees to which the origi- 
nal words are related to the classes. Based on these ideas, 



we generate a transformation matrix and each element of 
the matrix denotes a weight from one of the original fea- 
tures associated with a new feature which corresponds to a 
certain class. The weight for a word is large if it occurs fre- 
quently in the documents of the class with the underlying 
new feature. On the contrary, the weight is smaller if the 
word occurs less frequently in the documents of the under- 
lying class. Formally, we define the elements of transfor- 
mation matrix T in equation(4) as follows: 



= Pr{cj\wi). 



# of occurence of ti 



(5) 



# of occurence of Wi in all classes ' 



Thus, each new feature value w' rj of document d r can be 
calculated as follows: 



< = J> ri x Pr( Cj K). 



(7) 



Our method works in a straightforward way. Firstly, 
we calculate the probabilities of words over classes. 
Then, we generate the transformation matrix. Finally, 
we use the transformation matrix to convert documents 
from the original features to new features. For clar- 
ily, the algorithm of oui method is summarized below : 



Input: D is the set of documents, W is the set of words, C 
is the set of classes, I is the number of classes, and 
m is the number of words. 

Output: D' is the set of converted documents. 

1. For each word € W and each class cj 6 C, 
1 < i < m and 1 < j < I, calculate Pr{cj\wi) 
by equation (6). 

2. Obtain transformation matrix T by equation (5). 



3. Convert document set D h 
by equation (4). 



new document set D' 



After transformation of documents is done, we can per- 
form the classification task with the converted data instead 
of the original data. The computation of the transforma- 
tion matrix is to estimate the conditional probabilities of a 
class given a word. The probability estimation has a time 
complexity proportional to the number of documents. Our 
method has two advantages. Trial-and-error for determining 
the appropriate number of extracted features can be avoided. 
Computation demand is small and the method runs fast. 

4. Experiments and Results 

To show the effectiveness of our proposed method, ex- 
periments on two well-known data sets for text classifica- 
tion research, 20 Newsgroup (20NG) and Reuters-21578, 



are performed. Experiment 1 works on the 20 Newsgroup 
(20NG) corpus which contains about 20000 articles taken 
from the Usenet newsgroups. These articles are evenly dis- 
tributed over 20 categories; each category of 20 Newsgroup 
has about 1000 articles. We use two-thirds of the docu- 
ments for training and the rest for testing. The documents 
of Reuters-21578 are divided, according to the "ModApte" 
split, into 9603 training documents and 3299 testing doc- 
uments. To make a difference from 20 Newsgroup, the 
distribution of documents is skewed. The number of train- 
ing documents per class varies from 1 to about 4000, with 
top 10 classes containing 11 .V A of the documents and 28 
classes have fewer than 10 training documents. Experiment 
2 uses the documents of the top 10 classes. The number 
of words involved in Experiment 1 and Experiment 2 is 
25718 and 16285, respectively. To demonstrate the clas- 
sification capability of the reduced features, we choose the 
Naive Bayes classifier to do text classification. We compare 
our method with other methods on the classification accu- 
racy and running speed. 

4.1. Experiment 1: 20 Newsgroup Data 

Table 1 and table 2 show the classification accuracy (%) 
and execution time (sec) of the 20 Newsgroup data set ob- 
tained by our method, the Divisive Clustering (DC) based 
feature extraction method, and the Information Gain (IG) 
based feature selection method, respectively. Note that the 
20 Newsgroup data set contains 25718 features. 



Accuracy ' "< of our method w ith 20 features: 88.18 



Number of features 



37.02 54.16 78.54 83.6 



45.38 63.14 73.76 84.65 88.40 



.05 89.20 88.40 



Table 1. Accuracy % of three approaches on 
20 Newsgroup data with 1/3-2/3 test-training split. 



Execution time (sec) of our method : 1300.8 
Execution time (sec) of IG: 1337.3 



Number of features 



1442.0 I 1565.1 I 1698.1 I 1817.3 I 2178.5 



Number of features 



2799.1 4255.5 



17108.4 47235.6 



Table 2. Execution time (sec) of three approaches on 
20 Newsgroup data with 1/3-2/3 test-training split. 



As shown in these tables, our method achieves 88.18% 
accuracy with 20 features in 1300.8 seconds. The accu- 
racy is just 0.22% lower than that achieved by a full fea- 
ture Naive Bayes classifier (88.40%). With 20 features, 
DC achieves 18.54% accuracy in 1817.3 seconds and IG 
achieves 18.34%) accuracy in 1337.3 seconds. DC is bet- 
ter than ours in accuracy only when the number of fea- 
tures is more than 5000, but it spends much more time than 
ours. For example, DC achieves 89.2% with 5000 features 
in 47235.6 seconds. IG is better than ours only when the 
full features is used. Our method achieves almost the best 
accuracy D( ' or K i can achieve, but in much less lime. 

4.2. Experiment 2: Reuter-21578 Top 10 
Data 

Table 3 and table 4 show the classification accuracy (% ) 
and execution time (sec) results of the Reuter-21578 Top 
10 Data set obtained by our method, the Divisive Cluster- 
ing (DC) based feature extraction method, and the Informa- 
tion Gain (IG) based feature selection method, respectively. 
Note that this data set contains 16285 features. 



Accuracy % of our method with 10 features: 83.75 



Method 


Number of features 




5 


10 


20 


50 


100 


DC 


49.00 


76.50 


80.20 


8!. .28 


83.72 


82.89 


IG 


41.54 


52.33 


55.06 


57.35 


62.23 


68.54 


Method 


Num 


ict of features 




200 


500 


1000 


5000 


1 6285 


DC 


83.86 


84.47 


84.33 


84.43 


86.27 


IG 


72.53 


83.72 


84.65 


86.41 


86.27 


Ta 


ble 3. Accuracy 
Reuter-21 


% of three approaches on 
578 Top 10 data. 



Execution time (sec) of our method : 492.1 
Execution time (sec) of Information Gain: 513.6 



Method 


Number of features 




DC 


2 | 5 | 10 | 20 | 50 


Method 


Number of features 


100 1 200 1 500 1 1000 1 5000 


DC 


1100.3 | 1898.9 | 3418.5 | 5151.1 | 17781.2 



Table 4. Execution time (sec) of three approaches on 



Reuter-21578 Top 10 data. 

As shown in these tables, our method achieves 83.75% ac- 
curacy with 10 features in 492.1 seconds and the accuracy 
is just 2.52%. lower than the accuracy achieved by a full fea- 
ture Naive Bayes classifier (86.27%). With 10 features, DC 
achieves 80.2% accuracy in 563.3 seconds and IG achieves 
55.06' ; accuracy in 513.6 seconds. DC is better than ours 
only when the number of features is more than 200 but 



spends much more time (more than 1898.9 seconds). DC 
achieves 84.47% with 500 features in 3814.5 seconds and 
IG achieves 86.41%) with 5000 features in 513.6 seconds. 
Our method can achieve very good accuracy with much less 
time than DC and IG. 

5. Conclusion 

We have proposed a weight-based feature extraction ap- 
proach for document classification. The number of ex- 
tracted features is equal to the number of document classes 
and the feature values are obtained according to the distri- 
butions of words over class partitions. The proposed ap- 
proach has two advantages. Trial-and-error for determining 
the appropriate number of extracted features can be avoided. 
Computation demand is small and the method runs fast. Ex- 
perimental results obtained from two real-world data sets, 
20 Newsgroup (20NG) and Reuters-21578, have shown that 
our method can achieve very good classification accuracy 
in much less time than the divisive clustering based feature 
extraction method and the information gain based feature 
selection method. 

References 

[1] F. Sebasliani. Machine Learning in Automated Text Catego- 
rization. ACM Computing Surveys. Vol, 34. No. I. March 

2002, pp.l-47. 

[2] G. Salton and M. J. McGill, Introduction to Modern Re- 
trieval. McGraw-Hill Book Company, 1983. 

[3] Y. Yang and J. O. Pedersen, A comparative study on feature 
selection in text categorization. In Proceedings of 14th In- 
ternational Conference on Machine Learning. Morgan Kauf- 
mann, 1997, pp.412-420. 

[4] L. D. Baker and A. McCallum. Distributional clustering of 
words for text classification. In Proceedings of 2 1 st Annual 
International ACM SIGIR. 1998, pp.96-103. 

[5] N. Slonim and N. Tishby, The power of word clusters for 
text classification. In Proceedings of 23rd European Col- 
loquium on Information Retrieval Research (ECIR), 2001. 

[6] R. Bekkerman, R. El-Yaniv, N. Tishby and Y. Winter, Dis- 
tributional Word Clusters vs. Words for Text Categorization. 
Journal of Machine Learning Research I. 2002. pp. I -48. 

[7] F. Pereira, N. Tishby and L. Lee, Distributional clustering 
of English words. In 3 1st Annual Meeting of ACL, 1993, 
pp. 183-190 

[8] I. S. Dhillon, S. Mallela and R. Kumar, A Divisive 
Infromalion-'Lliearetic It a Inn < 'lush ring Algorithm for Text 
Classification. Journal of Machine Learning Research 3, 

2003, pp. 1265-1287. 



